Simulation + Linear Regression

Tuesday, May 28

Today we will…

  • New Material
    • Simulating Distributions
  • Work Time
    • PA 9.1: Instrument Con
    • PC4: Linear Regression

Statistical Distributions

Recall from your statistics classes…

A random variable is a value we don’t know until we take a sample.

  • Coin flip: could be heads (0) or tails (1)
  • Person’s height: could be anything from 0 feet to 10 feet.
  • Annual income of a US worker: could be anything from $0 to $1.6 billion

The distribution of a random variable tells us its possible values and how likely they are.

  • Coin flip: 50% chance of heads and tails.
  • Heights follow a bell curve centered at 5 foot 7.
  • Most American workers make under $100,000.

Statistical Distributions with Names!

Uniform Distribution

  • When you know the range of values, but not much else.
  • All values in the range are equally likely to occur.

Normal Distribution

  • When you expect values to fall near the center.
  • Frequency of values follows a bell shaped curve.

t-Distribution

  • A slightly wider bell curve.
  • Basically used in the same context as the normal distribution, but more common with real data (when the standard deviation is unknown).

Chi-Square Distribution

  • Somewhat skewed, and only allows values above zero.
  • Used in testing count data.

Binomial Distribution

  • Appears when you have two possible outcomes, and you are counting how many times each outcome occurred.

How do we use distributions?

  • Find the probability of an event.
    • If I flip 10 coins, what are the chances I get all heads?
  • Estimate a proportion of a population.
    • About what proportion of people are above 6 feet tall?
  • Quantify the evidence in your data.
    • In my survey of 100 people, 67 said they were voting for Measure A. How confident am I that Measure A will pass?

Distribution Functionns in R

r is for random sampling.

  • Generate random values from a distribution.
  • We use this to simulate data (create pretend observations).
runif(n = 3, min = 10, max = 20)
[1] 14.75828 10.56014 13.54659
rnorm(n = 3)
[1] -0.7706457 -0.2787969  0.5921814
rnorm(n = 3, mean = -100, sd = 50)
[1]  -71.95251 -146.07839 -129.39804
rt(n = 3, df = 11)
[1] -0.528545  1.662651 -3.108201
rbinom(n = 3, size = 10, prob = 0.7)
[1] 5 6 7
rchisq(n = 3, df = 11)
[1] 19.05226 18.19772 14.85055

p is for probability.

  • Compute the chances of observing a value less than x.
  • We use this for calculating p-values.
pnorm(q = 1.5)
[1] 0.9331928
pnorm(q = 70, mean = 67, sd = 3)
[1] 0.8413447
1 - pnorm(q = 70, mean = 67, sd = 3)
[1] 0.1586553
pnorm(q = 70, mean = 67, sd = 3, lower.tail = FALSE)
[1] 0.1586553

q is for quantile.

  • Given a probability \(p\), compute \(x\) such that \(P(X < x) = p\).
  • The q functions are “backwards” of the p functions.
qnorm(p = 0.95)
[1] 1.644854
qnorm(p = 0.95, mean = 67, sd = 3)
[1] 71.93456

d is for density.

  • Compute the height of a distribution curve at a given \(x\).
  • For discrete dist: probability of getting exactly \(x\).
  • For continuous dist: usually meaningless.

Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?

dbinom(x = 12, size = 20, prob = 0.3)
[1] 0.003859282

Empirical vs Theoretical Distributions

Empirical: the observed data.

Code
data %>%
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..), bins = 10, color = "white") 

Theoretical: the distribution curve.

Code
data %>%
  ggplot(aes(x = height)) +
  stat_function(fun = ~dnorm(.x, mean = 67, sd = 3),
                col = "steelblue", lwd = 2) +
  stat_function(fun = ~dnorm(.x, mean = 67, sd = 2),
                col = "orange2", lwd = 2)

Plotting Both Distributions

data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, color = "white")

  1. Plot your data.
data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, color = "white") +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3),
                color = "steelblue", lwd = 2)

  1. Add a density curve.
data |> 
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..),
                 bins = 10, color = "white") +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3),
                color = "steelblue", lwd = 2)

  1. Change the y-axis of the histogram to match the y-axis of the density.

Simulating Data

We can generate fake data based on the assumption that a variable follows a certain distribution.

  • We randomly sample observations from the distribution.
age <- runif(1000, min = 15, max = 75)

Since there is randomness involved, we will get a different result each time we run the code.

runif(3, min = 15, max = 75)
[1] 50.72983 66.67432 68.35707
runif(3, min = 15, max = 75)
[1] 67.11051 25.35165 35.57557

To make a reproducible random sample, we first set the seed:

set.seed(94301)
runif(3, min = 15, max = 75)
[1] 59.68333 30.27768 38.29962
set.seed(94301)
runif(3, min = 15, max = 75)
[1] 59.68333 30.27768 38.29962
set.seed(435)
fake_data <- tibble(names   = charlatan::ch_name(1000),
        height  = rnorm(1000, mean = 67, sd = 3),
        age     = runif(1000, min = 15, max = 75),
        measure = rbinom(1000, size = 1, prob = 0.6)) |> 
  mutate(supports_measure_A = ifelse(measure == 1, "yes", "no"))
head(fake_data)
# A tibble: 6 × 5
  names                 height   age measure supports_measure_A
  <chr>                  <dbl> <dbl>   <int> <chr>             
1 Elbridge Kautzer        67.4  66.3       1 yes               
2 Brandon King            65.0  61.5       0 no                
3 Phyllis Thompson        68.1  53.8       1 yes               
4 Humberto Corwin         67.5  33.9       1 yes               
5 Theresia Koelpin        71.4  16.1       1 yes               
6 Hayden O'Reilly-Johns   66.2  37.0       0 no                

Check to see the ages look uniformly distributed.

Code
fake_data |> 
  ggplot(aes( x = age,
             fill = supports_measure_A)) +
  geom_histogram(show.legend = F) +
  facet_wrap(~ supports_measure_A,
             ncol = 1) +
  scale_fill_brewer(palette = "Paired") +
  theme_bw() +
  labs(x = "Age (years)",
       y = "",
       subtitle = "Support for Measure A",)

simulate + map()

Write a function to simulate height data for a population with some average and SD height.

ht_func <- function(n = 200, avg, std){
  tibble(person = 1:n,
         ht = rnorm(n = n, mean = avg, sd = std))
}

ht_func(n = 5, avg = 66, std = 3)
# A tibble: 5 × 2
  person    ht
   <int> <dbl>
1      1  62.5
2      2  64.9
3      3  71.2
4      4  67.0
5      5  69.1

Simulate multiple datasets with different average heights and SDs of height.

Code
fake_ht_data <- expand_grid(mean_ht = seq(from = 60, to = 78, by = 6),
                          std_ht  = c(3, 6)) |> 
  mutate(ht_data = pmap(list(avg = mean_ht,
                             std  = std_ht),
                        ht_func)) |>
  unnest(cols = c(ht_data))

fake_ht_data |> 
  head()
# A tibble: 6 × 4
  mean_ht std_ht person    ht
    <dbl>  <dbl>  <int> <dbl>
1      60      3      1  57.2
2      60      3      2  56.0
3      60      3      3  55.3
4      60      3      4  59.7
5      60      3      5  61.5
6      60      3      6  58.7
Code
fake_ht_data |> 
  ggplot(aes(x = ht)) +
  geom_histogram(color = "white") +
  facet_grid(std_ht ~ mean_ht)

sample()

Suppose you want to take a random sample of values:

my_vec <- c("dog", "cat", "bunny", "horse", "goat", "chicken")
sample(x = my_vec, size = 3)
[1] "cat"   "bunny" "dog"  
set.seed(1)
sample(x = my_vec, size = 5, replace = T)
[1] "dog"   "horse" "dog"   "cat"   "goat" 

sample_n()

Suppose you want to take a random sample of observations from your data:

fake_sample <- fake_data |> 
  sample_n(size = 3)
fake_sample
# A tibble: 3 × 5
  names                    height   age measure supports_measure_A
  <chr>                     <dbl> <dbl>   <int> <chr>             
1 Dr. Rutherford Gutmann I   70.6  65.1       1 yes               
2 Vance Baumbach             70.1  34.3       1 yes               
3 Daron Macejkovic           61.7  33.9       0 no                

Simulation Example (Bdays)

Suppose there is a group of 50 people. Simulate the approximate probability at least two people have the same birthday (i.e. the same day of the year not considering year of birth; also ignore the leap year).

bDays <- function(n = 50){
  bday_data <- tibble(person = 1:n,
                      bday   = sample(1:365, size = n, replace = T))
  
  double_bdays <- bday_data |> 
    count(bday) |> 
    filter(n >= 2) |> 
    count() |> 
    pull(n)
  
  return(double_bdays > 0)
}

Simulation Example (Bdays)

Suppose there is a group of 50 people. Simulate the approximate probability at least two people have the same birthday (i.e. the same day of the year not considering year of birth; also ignore the leap year).

sim_results <- map_lgl(.x = 1:1000,
                       .f = ~ bDays(n = 50))

sum(sim_results)/1000
[1] 0.967

In-line Code

We can automatically include code output in the written portion of a Quarto document using `r `.

  • This ensures reproducibility when you have results from a random generation process.
my_rand <- rnorm(1, mean = 0, sd = 1)
my_rand
[1] 0.5819984

Type this: My random number is `r my_rand`.

To get this: My random number is 0.5819984.

PA 9.2: Instrument Con

Tuesdaym May 28

Today we will…

  • New Material
    • Review of Simple Linear Regression
    • Assessing Model Fit
  • PA 9.1: Mystery Animal
  • Lab 9: Baby Names

Some Data

NC Births Data

This dataset contains a random sample of 1,000 births from North Carolina in 2004 (sampled from a larger dataset).

  • Each case describes the birth of a single child, including characteristics of the:
    • child (birth weight, length of gestation, etc.).
    • birth parents (age, weight gained during pregnancy, smoking habits, etc.).

NC Births Data

library(openintro)
data(ncbirths)
slice_sample(ncbirths, n = 10) |> 
  knitr::kable() |> 
  kableExtra::scroll_box(height = "400px") |> 
  kableExtra::kable_styling(font_size = 30)
fage mage mature weeks premie visits marital gained weight lowbirthweight gender habit whitemom
24 22 younger mom 39 full term 5 not married 10 7.19 not low female nonsmoker white
NA 20 younger mom 39 full term 16 not married 19 5.44 low female nonsmoker white
34 31 younger mom 39 full term 15 married 22 8.69 not low male nonsmoker white
37 37 mature mom 41 full term 17 married 45 9.88 not low male nonsmoker white
29 29 younger mom 44 full term 12 married 51 8.50 not low male nonsmoker white
22 22 younger mom 40 full term 15 not married 31 7.06 not low female nonsmoker white
24 22 younger mom 39 full term 11 married 27 6.00 not low male smoker white
24 22 younger mom 39 full term 15 not married 45 7.50 not low male nonsmoker not white
35 25 younger mom 37 full term 30 married 60 3.94 low female nonsmoker white
30 34 younger mom 40 full term 10 married 22 9.31 not low male nonsmoker white

Relationships Between Variables

Relationships Between Variables

In statistical models, we generally have one variable that is the response and one or more variables that are explanatory.

  • Response variable
    • Also: \(y\), dependent variable
    • This is the quantity we want to understand.
  • Explanatory variable
    • Also: \(x\), independent variable, predictor
    • This is something we think might be related to the response.

Visualizing a Relationship

The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”

Characterizing Relationships

  • Form: linear, quadratic, non-linear?
  • Direction: positive, negative?
  • Strength: how much scatter/noise?
  • Unusual observations: do points not fit the overall pattern?

Your turn!

How would you characterize this relationship?

  • Shape?
  • Direction?
  • Strength?
  • Outliers?

Note: Much of what we are doing at this stage involves making judgment calls!

Fitting a Model

We often assume the value of our response variable is some function of our explanatory variable, plus some random noise.

\[response = f(explanatory) + noise\]

  • There is a mathematical function \(f\) that can translate values of one variable into values of another.
  • But there is some randomness in the process.

Simple Linear Regression (SLR)

If we assume the relationship between \(x\) and \(y\) takes the form of a linear function

\[ response = intercept + slope \times explanatory + noise \]

We use the following notation for this model:

Population Regression Model

\(Y = \beta_0 + \beta_1 X + \varepsilon\)

where \(\varepsilon \sim N(0, \sigma)\) is the random noise.

Fitted Regression Model

\(\hat{y}= \hat{\beta}_0 + \hat{\beta}_1 x\)

where   \(\hat{}\)   indicates the value was estimated.

Fitting an SLR Model

Regress baby birthweight (response variable) on the pregnant parent’s weight gain (explanatory variable).

  • We are assuming there is a linear relationship between how much weight the parent gains and how much the baby weighs at birth.

When visualizing data, fit a regression line (\(y\) on \(x\)) to your scatterplot.

ncbirths |> 
  ggplot(aes(x = gained, y = weight)) +
  geom_jitter() + 
  geom_smooth(method = "lm") 

The lm() function fits a linear regression model.

  • We use formula notation to specify the response variable (LHS) and the explanatory variable (RHS).
  • This code creates an lm object.
ncbirth_lm <- lm(weight ~ gained, 
                 data = ncbirths)
broom::tidy(ncbirth_lm)
# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   6.63     0.111       59.6  0         
2 gained        0.0161   0.00332      4.86 0.00000135
  • Intercept: expected mean of \(y\), when \(x\) is 0.
    • Someone gaining 0 lb, will have a baby weighing 6.63 lbs, on average.
  • Slope: expected change in the mean of \(y\), when \(x\) increases by 1 unit.
    • For each pound gained, the baby will weigh 0.016 lbs more, on average.

The difference between observed (point) and expected (line).

ncbirth_lm$residuals |> 
  head(3)
         1          2          3 
 0.3866279  0.9271567 -0.6133721 
resid(ncbirth_lm) |> 
  head(3)
         1          2          3 
 0.3866279  0.9271567 -0.6133721 
broom::augment(ncbirth_lm) |> 
  head(3)
# A tibble: 3 × 9
  .rownames weight gained .fitted .resid    .hat .sigma   .cooksd .std.resid
  <chr>      <dbl>  <int>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
1 1           7.63     38    7.24  0.387 0.00133   1.47 0.0000458      0.262
2 2           7.88     20    6.95  0.927 0.00157   1.47 0.000311       0.630
3 3           6.63     38    7.24 -0.613 0.00133   1.47 0.000115      -0.416

Model Diagnostics

There are four conditions that must be met for a linear regression model to be appropriate:

  1. Linear relationship.
  2. Independent observations.
  3. Normally distributed residuals.
  4. Equal variance of residuals.

Model Diagnostics

Is the relationship linear?

  • Almost nothing will look perfectly linear.
  • Be careful with relationships that have curvature.
  • Try transforming your variables!

Are the observations independent?   Difficult to tell!

What does independence mean?

Should not be able to know the \(y\) value for one observation based on the \(y\) value for another observation.

Independence violations:

  • Stock market prices over time.
  • Geographical similarities.
  • Biological conditions of family members.
  • Repeated observations.

Are the residuals normally distributed?

Less important than linearity or independence:

Do the residuals have equal (constant) variance?

  • The variability of points around the regression line is roughly constant.
  • Data with non-equal variance across the range of \(x\) can seriously mis-estimate the variability of the slope.
ncbirth_lm |> 
  augment() |> 
ggplot(aes(x = .fitted, y = .resid)) +
  geom_point()

Assessing Model Fit

Sum of Square Errors (SSE)

  • This is calculated as the sum of the squared residuals.
  • A small SSE means small differences between observed and fitted values.
  • Also: Sum Sq of Residuals or deviance.
anova(ncbirth_lm)
Analysis of Variance Table

Response: weight
           Df  Sum Sq Mean Sq F value    Pr(>F)    
gained      1   51.36  51.357  23.642 1.353e-06 ***
Residuals 971 2109.32   2.172                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Assessing Model Fit

Root Mean Square Error (RMSE)

  • The standard deviation of the residuals.
  • A small RMSE means small differences between observed and fitted values.
  • Also: Residual standard error or sigma.
summary(ncbirth_lm)

Call:
lm(formula = weight ~ gained, data = ncbirths)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4085 -0.6950  0.1643  0.9222  4.5158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.63003    0.11120  59.620  < 2e-16 ***
gained       0.01614    0.00332   4.862 1.35e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.474 on 971 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.02377,   Adjusted R-squared:  0.02276 
F-statistic: 23.64 on 1 and 971 DF,  p-value: 1.353e-06

Assessing Model Fit

R-squared

  • The proportion of variability in response accounted for by the linear model.
  • A large R-squared means the explanatory variable is good at explaining the response.
  • R-squared is between 0 and 1.
broom::glance(ncbirth_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0238        0.0228  1.47      23.6 0.00000135     1 -1757. 3520. 3535.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model Comparison

Regress baby birthweight on…

… gestation weeks.

weight_weeks <- lm(weight ~ weeks, 
                   data = ncbirths)
  • SSE = 1246.55
  • RMSE = 1.119
  • \(R^2\) = 0.449

… number of doctor visits.

weight_visits <- lm(weight ~ visits, 
                    data = ncbirths)
  • SSE = 2152.74
  • RMSE = 1.475
  • \(R^2\) = 0.01819

Why does it make sense that the left model is better?

Multiple Linear Regression

When fitting a linear regression, you can include…

…multiple explanatory variables.

lm(y ~ x1 + x2 + x3 + ... + xn)

…interaction terms.

lm(y ~ x1 + x2 + x1:x2)

lm(y ~ x1*x2)

…quadratic relationships.

lm(y ~ I(x1^2) + x1)

PA 9.1: Mystery Animal

Lab 9: Baby Names

To do…

  • PC3: Project Proposal + Data
    • Due TODAY, 5/28 at 11:59pm
  • PA 9.1: Mystery Animal
    • Due Wednesday, 5/29 at 10:00am
  • Lab 9: Baby Names
    • Due Saturday, 6/1 at 11:59pm
  • PC4: Linear Regression
    • Due Monday, 6/3 at 11:59pm

To do…

  • PA 9.2: Instrument Con
    • Due Saturday, 6/1 at 11:59pm.
  • Lab 9: Baby Names
    • Due Saturday, 6/1 at 11:59pm.
  • Read Chapter 10 and complete Check-in 10.1.
    • Due Monday, 6/3 at 10:00am.
  • PC4: Linear Regression
    • Due Monday, 6/3 at 11:59pm.